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Abstract 

We study the role played by the dilution in the average behavior of a perceptron model with 
continuous coupling with the replica method. We analyze the stability of the replica symmetric 
solution as a function of the dilution field for the generalization and memorization problems. 
Thanks to a Gardner like stability analysis we show that at any fixed ratio a between the number 
of patterns M and the dimension N of the perceptron (a = M/N), there exists a critical dilution 
field h c above which the replica symmetric ansatz becomes unstable. 
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INTRODUCTION 

Neural networks [1] have become among the most studied and successful models in the 
field of artificial intelligence. In spite of more than 50 years of research, the field is still thriv- 
ing: Q-Q. In the last years, thanks to the advances of the new-generation high-throughput 
technologies in molecular biology, the field has experienced a renewed interest with prob- 
lems coming from the analysis of high dimensional data in biology, where sparsity in the 



underlying model is a general feature 



Mi- 



In this paper we address the issue of the stability of the replica symmetric solution pre- 
sented in [9[ which has important implications for the implementation of many algorithmic 
strategies. This paper extends thus the analysis presented in {9] and generalizes the stability 



results of [10J to the external dilution case. 

A standard problem in artificial intelligence is that of generalization [10J, 111] . A set 
of M patterns x M is given together with a binary output variable y^ for each of them 
(/i G {1, 2, . . . , M}). A pattern x is an N dimensional vector and we are interested in 
learning the hidden relation among its components and the classification value y^ related 
to pattern \i. In the simplest setup, a linear perceptron tries to encode such relation in a 
vector J (called student) such that the binary classification y^ = ±1 is reproduced by 

y" = Sign(J • &*) 

We will consider that the classification is actually generated by an unknown teacher J° 

= Sign( J° • x" + if) (1) 

and therefore, the aim of the student J is to be as close as possible to the teacher J°. A 
Gaussian noise t] 11 ~ iV(0,7 2 ) is added to the classification function to account for experi- 
mental noise in the data. Two interesting limits are studied: no noise 7 = 0, and random 
classification 7 = 00. 

As a first step to analyze the problem, we define an energy function counting the number 
of patterns that are wrongly classified by the student J: 

M 

£(/) = ^ 0(-^J (2) 



In the following we will consider only the case J G 1Z . Since the energy expression depend 
only on the angle of J and not on his length, i.e. E(cJ) = E(J) for all c > 0, we restrict 
ourself to the surface of a sphere Ei=i J? = N. 



Standard statistical mechanics method 



as [2 



- |l4| have been largely used to study the 



thermodynamic properties of this problem. In a recent work [9] a slightly different point 
of view has been analyzed: which is the J that minimizes the energy function with the 
largest possible number of coordinates equal to zero or, in other words, which is the sparsest 
possible J compatible with a correct classification? To address this issue one needs to study 
the performance of perceptron in presence of an external dilution field h coupled to the 
classification vector J. To this end one can consider the following Hamiltonian: 

0H(J)=PE(j)+h\\J\\ p (3) 

where the dilution field h, in the last term, acts as a chemical potential on J. Two interesting 
cases were studied at the replica symmetric level: the L\ norm (||J||i = Ei=i 1^1) an d the 
L norm (||J|| = lim p ^ Eii \ J i\ P = Eiit 1 ~ 

Of course this problem has practical interest in the case one knows a priori when the 
teacher is actually sparse. In many real life problems Srli], Q] the multidimensional pat- 
terns x contain irrelevant information, in the sense that only few of the components are 
actually considered in the classification process. We will consider that only a fraction nP B 
of the teacher's components are non zero. Therefore each teacher's component is extracted 
independently from the distribution 

p(Jo) = (l-<)5 Jo +<p / (J ). (4) 

It has been shown fl, fl that dilution with L p norm, with p < 1, forces a fraction of the 
components of J to be exactly zero, and therefore, allows a selection of the components that 
actively participate in the classification process. 

The cost function %{J) raises severe computational problems due to the scale invariance 
of the energy which makes the optimization problem non-convex in general. At odd with 
problems like compressed sensing [ljjj], for which optimization methods of order 0(N 3 ) can 
be used for the L\ norm [5] whereas the most effective L norm [5j makes the minimization 
an NP-hard optimization problem , in our perceptron-like case we are not aware of 
any ad-hoc numerical technique for finding efficiently the minimum of the cost function in 
Eq. ([3]), apart from Monte Carlo based optimization methods. 



From a theoretical perspective, we are interested in the average case properties of the 
structure of the solution space in either cases L\ and Lq. The paper is organized in the 
following way: the next section (JI|) fixes the notation and sketches the replica symmetric 
results presented in |9|], in section [III we study the stability of this replica symmetric solution, 
in section II III we show that at h = the replica symmetric solution is always stable, while 
at h — > oo is always unstable, and therefore a critical value of the dilution field marks the 
threshold between two phases, computed in section IIVI Finally, our results are summarized 
in section [V] and the main results of 9[ are re-evaluated in the present context. 



REPLICA SYMMETRIC SOLUTION TO DILUTED PERCEPTRON 



Using the replica method [12I . [r] ] one can compute the average free energy as the limit 
—/3f = lim^o limjv^oo log Z n /nN. The replica symmetric (RS) ansatz assumes that any 
two replicas of the system have the same overlap, and therefore the corresponding overlap 
matrix and its Fourier conjugate assume the following structure: 
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(5) 



These are symmetric (n + 1) x (n + 1) matrices where the first column (row) contains 
the teacher's parameters. For instance, t is the variance of the teacher and is related to the 
distribution (j3J), and is not a variational parameter. On the other hand, r = (-^J • J°) is 
the average overlap between the teacher and the student, and q = (jjJ a - J^} is the average 
overlap between two replicas (see (9] for details). 

The correct value for the free energy is obtained by minimizing the variational RS free 
energy 

- f3f = -rf + ^qq - A + Gj + a G x 



with respect to the variational parameters (q, r, q, f , A), where 

Gj = Jdx J dJ p(J°)log / d j e -(i-^J 2 -HJ\\ P +{rJ°-V-^)J (6) 

In these equations, and from now on, Dx = dx exp(— x 2 /2)/\/2tt is a Gaussian measure in 

POO 

x, and the function H(y) = / Dx . 

A key role is played by a = j?, the ratio between the number of patterns M and the space 
dimensionality N, that measures the amount of information we have and, together with h, 
is the fundamental parameter controlling the quality of the generalization. We assume that 
in the limit iV — > oo, a remains finite. 



II. STABILITY A LA GARDNER 



The main goal of this work is to analyze the stability of the replica symmetric solution. 
Other structures for the overlap matrices ([5]) are indeed possible. In analogy with the or- 
ganization of the thermodynamic states in the low temperature phase of mean field spin 
glass-like models [16| , we expect that the space of solutions of our model could be described 
by a hierarchical (ultrametric) organization of replicas, or, in other terms, that our system 
might undergo a spontaneous replica symmetry breaking (RSB). From a purely geometric 
point of view this process is related to the fragmentation of the space of zero energy config- 
urations into unconnected regions, or equivalently, the fragmentation of the Gibbs measure 
into many pure states: 




'4 



FIG. 1. Representation of a connected (left) and a disconnected (right) solution space on the 
sphere. Connected phase space is typical of replica symmetric solutions, while the breaking in the 
symmetry of the replicas implies the clusterization of the space. 
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We will now apply Gardner's method for the stability analysis of replica symmetric solu- 
tion in 13| . studying the eigenvalues of the Hessian matrix: 



\\d 2 Wnf n )\\ 



II 8*(j3nfn) II 

" dQabdQcd " 

\ dQ ab 8(-iQ cd ) 



|| d 2 (/3nf n ) || 
dQabd(-tQcd) 
| 9 2 (/9n7Q 
9(-iQab)d{-iQcd) 



(7) 



evaluated at the RS fixed point. Details of the calculation are given in the Appendix lAl 
The eigenvalues of the n(n + 1) x n(n + 1) Hessian matrix are simply related to the 

spectrum of the four sub-matrices. Among the eigenvectors only those with eigenvalues 8\ 

and 82, signal an instability of the replica symmetric solution: 



8i8 2 = 7i72 - 1 (8) 

where 71 and 72 are eigenvalues of the inner matrices II f^a^ II and \\—, — d A * . 11, given 
by 

/ \ ( / -'3. \ 2 

2a ( / yr \ I / e 2 \ hoe 2 



71 = - T , « / D ?/^ 



1 " ?) 2 7 1 v 7 ^ 2 + qt - r 2 / I V2^{H{h ) - 1) I V^(H(h ) - 1) 



where Hj = -(f - A) J 2 + (fj - yfqx)J - h\J\ p . 

The sign of the product 8182 determines the stability of the extremal point: when 8182 < 
the point is stable, and unstable otherwise. 



III. EXTREME CASES: h = AND h 00 



In the zero temperature case, the Gibbs measure concentrates over the states of lower 
energy, i.e. over those vectors J that correctly classify most patterns. In the absence of 
dilution (h = 0), the measure is uniform over these states, and the free energy is thus a 
measure of their entropy (we are working at zero temperature). On the other extreme, 
we have the high dilution limit h — > 00 that we studied in [9| at replica symmetric level. 
Next we check the stability of the replica symmetric solution for both the Memorization 
and Generalization problems in these two extreme situations. As we will see, the replica 

6 



symmetric solution is stable for h = and unstable for h — > oo, and therefore, there is a 
critical value h c (a) dividing the RS phase from the non-RS one. 



A. Stability of non diluted perceptron h = 



In the absence of dilution (h = 0) we still have two cases: memorization (7 — > 00) and 
generalization (7 = 0). In the high noise limit 7 — > 00 (see eq. ([I])) there is no rule to infer 
because the experiments are randomly classified, and the perceptron tries to memorizes the 
output variable . In the 7 = case, the patterns are classified according to the teacher 
and this rule can be generalized. 

In the memorization limit (7 — > 00), an RS solution of zero energy always exists for 
< a < 2 [13I . For these values of a the student is capable of memorizing the experiments, 
whereas for a > 2 there are no zero energy solutions, and, solutions are not replica sym- 
metric. In agreement with this known behavior (see 13] ), the product of the eigenvalues 
<5i<52, evaluated in the non diluted memorization replica symmetric solution, is negative for 
a < 2. Its value grows from — 1 at a = to at a = 2 (Figure H]), an indication that the 
zero energy replica symmetric solution is stable in this range. 



Memorization 
Generalization 
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FIG. 2. The eigenvalues product (left) evaluated in the replica symmetric solution for the gener- 
alization and memorization non diluted. The corresponding self-overlap q of the student (right). 



In the generalization limit (7 = 0) the perceptron can asymptotically learn the classifi- 
cation rule provided with a big enough amount of data as shown in [9j. However, for any 
finite amount of information a, there is not just one zero-cost student, but a continuum of 
them filling a bounded region of the J's space (see left panel of Fig. [1]). An indication of 
the size of the solution space is given by the student-student overlap q = (J a J b ). As in the 



memorization case (for a < 2), the solution space is connected, and, consistently in this 
case the product 8182 remains always negative (see Fig. [2]), which means that the replica 
symmetric solution is stable for every a as in [lO]. As a — > 00 the solutions space shrinks 
around the correct value J°, and q — > 1 (see Fig. [2]). 

As expected, as little information is given to the student (low values of a), there is little 
difference between the memorization and the generalization. Figure [2] shows that also the 
stability, given by the product 8182 for the non diluted generalization and memorization 
coincides for small a and have the same limit for a — > + . 



B. Stability of diluted perceptron h = 00 



The dilution field select those solutions with the lowest values of t 



he norm. In particular, 

Lq and L\ norms are known to force the sparsity of the solutions pushing a fraction of 
the students components to be zero. Seeking for sparse solutions can enhance the efficiency 



in the use of available information 



in particular when the teacher is actually sparse. 



In the following we will consider a 95% sparse teacher, i.e. n° cS = 0.05 in Eq. ((4]). 

The behavior of large dilution field h — > 00 limit depends on the norm used and so does 

n 

the learning behavior of the perceptron |9j. We restrict ourselves to the study of the space 



of E = solutions: to do so we take the limit T — > first, and then the limit h — > 00. 
Again in jg] it was shown that, in the replica symmetric case, h — > 00 pushes q — > 1 at any 
a, concentrating the Gibbs measure over a single point: the most diluted zero energy J. 
However, the question remains as whether the replica symmetric ansatz is still valid or the 
strong dilution field fractures the Gibbs measures into unconnected components. 

In figure |3] we show the eigenvalues product for the L and Li dilutions in the h — > 00 
limit for both, the memorization and the generalization cases. At variance with the non 
diluted case, the h = 00 one is always unstable. 



IV. PHASE DIAGRAM 

In the region of a where perfect memorization/generalization is possible, the replica 
symmetric ansatz is stable in absence of dilution and unstable in presence of a very strong 
dilution field h — > 00. Therefore there should exist a critical value for the dilution field 
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FIG. 3. The eigenvalues product evaluated in the replica symmetric solution for the Lq and the 
L\ norms in generalization and memorization. 

h c (a) separating the RS from the RSB phase. In figure H] we show h c (a) for the L\ and the 
p = 0.1 norms applied to models with different teachers dilutions. 



p 

t3 





a a 
FIG. 4. The phase diagram for the Li(left) and the p = 0.1 (right) norms with severals teacher's 
dilution n^j. The replica symmetric solution is stable for values of a and h below the lines. 

The similarity of the stability curves near a = is because with such a few information 
there is not big difference between memorizing and generalizing. The curves separate around 
a > 0.4. The value of the critical dilution field increases with a. This can be rationalized as 
follows: more information implies a reduction in the size of the E — solution space, and 
therefore different solutions are closer, so a higher dilution field is needed to clusterize the 
set of zero energy students. In generalization, when a — > oo the solution space is composed 
by an only vector which is precisely the teacher and the replica-replica overlap q achieves its 
maximum value 1, consequently the critical curves go to infinity too. For the memorization 



case, the value q = 1 is achieved for a = 2, that's why the critical curve diverges here. 

Drawing the critical lines for the Lq norm is not obvious because setting p = in the 
equations, before taking the strong dilution limit h — > oo, eliminates the effect of any finite 
dilution field, so there is no way to find h c (see |9j). What we have done instead is to study 
the critical lines for a value of the norm exponent p small enough. The results using p = 0.1 
are shown in the right panel of Fig. |H Again, around a = both, memorization and 
generalization, behave similarly. The memorization critical line also diverges when a — > 2 
while the generalization ones increases with a as for the L\ norm. In both cases, L\ and 
Lq.i, the critical field h c {a) depends on the sparsity n° cS of the teacher (eq. (TjJ). 



V. CONCLUSIONS 

We have performed a full stability analysis of the replica symmetric solution for the non 
diluted and diluted generalization and memorization problems for a learning perceptron. 
Imposing a dilution should improve the results obtained in real algorithms, but comes with 
a price. We showed that even in the satisfiable phase (where zero cost solutions exists) an 
infinite dilution breaks the symmetry of the replicas, whereas no dilution at all keep the 
replica symmetric solution stable. The breaking of the symmetry is usually connected with 
convergence problems in algorithms like belief propagation. Yet, it is always possible to 
find a dilution field weak enough for the replica symmetric solution to be stable, provided 
it exists. The critical dilution field h c (a) depends on the actual sparseness of the teacher. 



Figure EJ partially taken from {9], shows the generalization error achieved in average case 
by a perceptron without a dilution filed, and with a very strong (h — > 00) dilution field. The 
teacher used is quite sparse (n° ff = 5%), and therefore the learning process is enhanced by 
the use of dilution. Restricting ourselves to the replica symmetric space, the h — > 00 limit is 
not achievable, and h c (a) is the best we can do. The gain in accuracy (lower error) obtained 
with an /i c -diluted perceptron is not as impressive as the h — > 00 case, but still improves the 
results without dilution. 

The present work is useful as it defines the phase space of the replica symmetric solution 
in the information-dilution (a — h) plane. The actual relation of this predictions to the 
behavior of algorithms remains as an open project for the future. Also left for the future is 
the use of 1RSB (or higher) parameters to make average predictions in the RSB phase. 
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FIG. 5. First three lines (in caption order) are the generalization error obtained without dilution 
and with strong h — > oo dilution for the L\ and Lq norms (results taken from The last two 
lines correspond to the generalization error obtained with h = h c {a) in the L\ and Lo.i cases. 
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Appendix A: Details of the stability calculation 



The variational free energy in terms of the all Q a ^ and their Fourier counterparts is (see 
[sj]), after taking the iV — > oo, given by 



Pnf n 



' 2_j QabQab ~ 
a<b 

log /fl^ a / dJ p(J°)e- h Z«m-^ a 



J a J b 



-a log / Drj Yl 



dx a da; a 



a=0 



Deriving respect to its (n + l)(n + 2) parameters Q a b and Q a b, one can construct the 
second derivatives matrix or Hessian of /3nf n .The sign of the eigenvalues of this matrix 
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evaluated in an extremal point will give us all the information about its stability. 
The free energy second derivatives are 

°' [ ' h!j " ' -a(l- — )(1 - — ){{x a x b ) (£ c x d ) - (£ a x b £ c x d )) 



dQabdQcd 2 jy 2 

(J a J b J c J d ) - (J a J b )(J c J d ) 



5(-zQa6)<9(-zQ C(i ) 

dQabd(-iQcd) 



S, 



ab,cd 



where 



(<?(x\x 6 ,...)) 



/D77 n"-0 d ffS a ff(x",X fo , „.) e -^E2=l 9 (- !Btt ( ai °+ 7 ?))+*Ea=l^ a -|Ea,6 0ai.^ 



-hT,a\\J a \\l-^a< b Qa b J a J h 



The structure of the Hessian matrix is: 



/ II d 2 (l3nf n ) I, || d 2 (f3nf n ) n 

II 9 2 (/3njn) II || a 2 (/3n^) _ 

\ 9Qa&9(-«Qcd) 9(-«Qa6)9(-«Qcd) 



(Al) 



one subspace 



The eigenvector space of the Hessian matrix can be divided in two 
corresponding to instabilities inside the replica symmetric space (and whose eigenvalues signs are 
those who determines among of all the possible replica symmetric solutions the one who actually 
minimizes the free energy), and the other corresponding to instabilities that takes our extremal 
point outside the replica symmetric space. The first of these subspaces has no interest to our stabil- 
ity analysis, because solving the replica symmetric fixed point equations is equivalent to searching 
for the stable extremal replica symmetric point, so, we are going to focus on the eigenvalues of the 
eigenvectors belonging to the second subspace. A detailed study of the Hessian matrix shows that 
it has just two non replica symmetric values, 5\ and 82 ■ The product 8182 can be expressed as in 
eq. ([8]) in terms of the eigenvalues of the inner matrices. 
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